Fix(consolidation): populate search_vector on observation INSERT/UPDATE in consolidator#2425
Merged
nicoloboschi merged 5 commits intoJun 30, 2026
Conversation
The consolidator creates and updates observations without populating the search_vector tsvector column. Under the native text search extension, this means observations are invisible to BM25 full-text retrieval - the BM25 arm returns 0 candidates regardless of query content. Four code paths write observation text to memory_units: 1. _dedup_reconcile_create (merge into existing twin) 2. _dedup_reconcile_update (drift-merge into different twin) 3. _execute_update_action (LLM rewrite of existing observation) 4. _create_observation_directly (new observation INSERT) None populated search_vector. This patch adds conditional tsvector generation gated on config.text_search_extension == 'native', matching the existing pattern in ops_postgresql.insert_facts_batch. Non-native backends (pg_textsearch, pgroonga, pg_search) continue to leave search_vector NULL as they index base text columns directly. The INSERT path (Site 4) splits the existing else branch into an explicit elif/else to avoid applying native tsvector logic to backends that don't use it. Fixes: observations invisible to BM25 retrieval arm.
Add test for observation creation with native search vector
Contributor
Author
|
Deployment note: Existing observations created before this fix still have UPDATE memory_units
SET search_vector = to_tsvector('english'::regconfig, COALESCE(text, ''))
WHERE fact_type = 'observation' AND search_vector IS NULL;Suggest including this in the release notes. Note: the For external PG instances this is a manual step; for the default baked-in pg0 deployment it could potentially be an Alembic migration, but that's a separate decision for maintainers. |
…ations The writer fix only populates search_vector for observations created or updated after deploy. Observations already written under the native backend keep a NULL search_vector and stay invisible to BM25 until re-consolidated. Add migration c3f7a1b9d2e4 to backfill them, gated on the native tsvector column type and scoped to fact_type='observation' with a NULL search_vector (idempotent). Matches the writer's text-only tsvector and the configured native language.
post-checkout/post-commit/post-merge/pre-push were git-lfs stubs picked up from the contributor's local hookspath and committed by mistake. They are unrelated to this change; the project's real .githooks/pre-commit is left intact.
nicoloboschi
added a commit
that referenced
this pull request
Jul 1, 2026
…nfig, trace recorder leak) (#2491) * test(consolidation): fix dedup merge-path tests missing text-search config The dedup merge/update path builds a search_vector UPDATE clause from config.text_search_extension (+ _native_language) since #2425, but the _dedup_reconcile_create / _dedup_reconcile_update test configs only set consolidation_dedup_threshold, so the two merge-path tests raised AttributeError: 'types.SimpleNamespace' object has no attribute 'text_search_extension' on main. Add the two fields (production defaults native/english) to those configs. The clause reuses $1, so the existing positional-arg assertions are unchanged. * test(fact-extraction): pass Vertex AI settings when building LLMConfig Regression: LLMConfig was refactored to use vertexai_project_id/region/ service_account_key as-passed (the caller resolves the global-config fallback), but the llm_config fixture never forwarded them. So with the CI provider set to vertexai, LLMConfig raised "HINDSIGHT_API_LLM_VERTEXAI_PROJECT_ID is required" even though the env var was set — the test errored in the test-api job. Forward the three Vertex settings from config (mirroring MemoryEngine's own LLMConfig construction). Verified: LLMConfig(provider="vertexai", ...) now constructs with project_id passed, and still raises when it is omitted. * test(llm-provider): forward provider-specific settings in _make_llm _make_llm() built an LLMProvider from the env-selected provider without forwarding provider-specific settings, so the vertexai and litellmrouter acceptance-matrix jobs failed at construction ("VERTEXAI_PROJECT_ID is required" / "litellmrouter requires a config object"). LLMProvider uses these as-passed (it does not resolve them from global config), so forward vertexai_project_id/region/service_account_key and litellmrouter_config. * test(llm-trace): drop leaked span recorders after each test (#2229) Root cause of the flaky test_llm_trace::test_disabled_writes_no_rows: MemoryEngine.__init__ registers its LLM-trace recorder in a process-global registry, and only close() removes it. Tests that construct an engine directly (test_per_operation_llm_config, test_llm_reasoning_effort_env, etc.) never close it, leaking an ENABLED recorder. Locally those recorders' writes fail (uninitialized backend), but in CI a leaked recorder with a live backend records a later test's LLM calls into the shared DB — so test_disabled_writes_ no_rows sees rows for its bank even though its own recorder is disabled (assert N == 0). Reproduced: after test_per_operation_llm_config the registry holds 8 enabled recorders. Add an autouse fixture that snapshots the registry and removes anything a test leaked. Verified: the registry drops from 8 leaked recorders back to 1. _teardown_memory_engine already guards the fixtures; this guards direct constructions. 53 trace + leak-risk tests pass together under -n2.
oliverv
pushed a commit
to oliverv/collabmind-hindsight
that referenced
this pull request
Jul 1, 2026
…TE in consolidator (vectorize-io#2425) * fix: populate search_vector on observation INSERT/UPDATE in consolidator The consolidator creates and updates observations without populating the search_vector tsvector column. Under the native text search extension, this means observations are invisible to BM25 full-text retrieval - the BM25 arm returns 0 candidates regardless of query content. Four code paths write observation text to memory_units: 1. _dedup_reconcile_create (merge into existing twin) 2. _dedup_reconcile_update (drift-merge into different twin) 3. _execute_update_action (LLM rewrite of existing observation) 4. _create_observation_directly (new observation INSERT) None populated search_vector. This patch adds conditional tsvector generation gated on config.text_search_extension == 'native', matching the existing pattern in ops_postgresql.insert_facts_batch. Non-native backends (pg_textsearch, pgroonga, pg_search) continue to leave search_vector NULL as they index base text columns directly. The INSERT path (Site 4) splits the existing else branch into an explicit elif/else to avoid applying native tsvector logic to backends that don't use it. Fixes: observations invisible to BM25 retrieval arm. * Implement test for search vector population in observations Add test for observation creation with native search vector * style: run lint * fix(consolidation): backfill search_vector for existing native observations The writer fix only populates search_vector for observations created or updated after deploy. Observations already written under the native backend keep a NULL search_vector and stay invisible to BM25 until re-consolidated. Add migration c3f7a1b9d2e4 to backfill them, gated on the native tsvector column type and scoped to fact_type='observation' with a NULL search_vector (idempotent). Matches the writer's text-only tsvector and the configured native language. * chore: remove accidentally committed git-lfs hooks post-checkout/post-commit/post-merge/pre-push were git-lfs stubs picked up from the contributor's local hookspath and committed by mistake. They are unrelated to this change; the project's real .githooks/pre-commit is left intact. --------- Co-authored-by: Nicolò Boschi <boschi1997@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The consolidator creates and updates observations in
memory_unitswithout populating thesearch_vectorcolumn. Under thenativetext search extension backend, this makes all observations invisible to BM25 full-text retrieval.This is a known gap documented in the source code itself:
This PR fills that gap.
The raw fact batch insert path (
ops_postgresql.insert_facts_batch) correctly populatessearch_vectorwithto_tsvector(). The consolidation pipeline - which produces the observation layer that recall surfaces - does not. The BM25 retrieval arm consistently returns 0 candidates for observations:After the fix:
Root cause
Four code paths in
consolidator.pywrite observation text without populatingsearch_vector:_dedup_reconcile_create_dedup_reconcile_update_execute_update_action_create_observation_directlyFix
Add conditional
search_vectorpopulation to all four sites, gated onconfig.text_search_extension == "native". This matches the existing pattern inops_postgresql.insert_facts_batch.For UPDATE paths (Sites 1-3), a
search_vector_clausestring is conditionally appended to the SET clause:For the INSERT path (Site 4), the existing
elsebranch (which bundled native + pg_textsearch + pgroonga + pg_search) is split intoelif "native"(with tsvector) andelse(without), ensuring non-native backends continue to leavesearch_vectorNULL as they index base text columns directly.Backend gating rationale
The fix is restricted to
text_search_extension == "native"because:tokenize()- already handles search_vectorto_tsvector()populationtextcolumn directly via their own extension mechanisms -search_vectoris a dummy column that should remain NULLtsvector content: observation text only
The tsvector is built from
COALESCE(text, '')- the observation's own text content only. Entity names, source fact text, and temporal metadata are intentionally excluded because:BM25's role is keyword matching on the observation's surface text - the one signal the other three arms don't cover. Enriching the tsvector with entity names or source context would create "contextual leakage" where non-central details pull general observations into narrow keyword matches.
Regression risk
None. The fix only fires when
text_search_extension == "native"- all other backends are unaffected. The tsvector is built from the observation's own text content using the configured language dictionary (config.text_search_extension_native_language), matching the existing pattern ininsert_facts_batch. No new parameters are added to any SQL query.Testing
Added
test_create_observation_populates_search_vector_nativeintests/test_consolidation.py. Asserts that observations created via consolidation havesearch_vector IS NOT NULLunder the native text search backend (Site 4 - the INSERT path, highest frequency).The UPDATE path (Site 3) is verified to work via live deployment testing but not unit-tested - triggering an UPDATE deterministically requires a real LLM making consolidation decisions (the mock LLM always creates fresh observations). Sites 1-2 (dedup reconcile) are similarly impractical to trigger without controlling embedding similarity thresholds.
Additionally verified on a live deployment with ~139 observations:
search_vector IS NULL, BM25 returns 0